Gwen claims certain attributes are not relevant. See the spreadsheet she attached on Moodle. These attributes WILL NOT be filled in.
library(corrplot)
corrplot 0.92 loaded
library(ggplot2)
library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
── Attaching packages ─────────────────────────────────────────────────────── tidyverse 1.3.2 ──✔ tibble 3.1.8 ✔ dplyr 1.0.10
✔ tidyr 1.2.1 ✔ stringr 1.4.1
✔ readr 2.1.2 ✔ forcats 0.5.2
✔ purrr 0.3.4 ── Conflicts ────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
library(fastDummies)
library(GGally)
Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
Get rid of old stuff
rm(list=ls())
par(mfrow=c(1,1))
# simple R program to read csv file using read.table()
raw_data <- read.csv2("./LCdata.csv", header = TRUE, row.names=NULL, sep=";")
head(raw_data)
summary(raw_data)
id member_id loan_amnt funded_amnt funded_amnt_inv
Min. : 54734 Min. : 70473 Min. : 500 Min. : 500 Length:798641
1st Qu.: 9207230 1st Qu.:10877939 1st Qu.: 8000 1st Qu.: 8000 Class :character
Median :34433372 Median :37095300 Median :13000 Median :13000 Mode :character
Mean :32463636 Mean :35000265 Mean :14754 Mean :14741
3rd Qu.:54900100 3rd Qu.:58470266 3rd Qu.:20000 3rd Qu.:20000
Max. :68617057 Max. :73544841 Max. :35000 Max. :35000
term int_rate installment emp_title emp_length
Length:798641 Length:798641 Length:798641 Length:798641 Length:798641
Class :character Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character Mode :character
home_ownership annual_inc verification_status issue_d
Length:798641 Length:798641 Length:798641 Length:798641
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
loan_status pymnt_plan url desc purpose
Length:798641 Length:798641 Length:798641 Length:798641 Length:798641
Class :character Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character Mode :character
title zip_code addr_state dti delinq_2yrs
Length:798641 Length:798641 Length:798641 Length:798641 Min. : 0.0000
Class :character Class :character Class :character Class :character 1st Qu.: 0.0000
Mode :character Mode :character Mode :character Mode :character Median : 0.0000
Mean : 0.3145
3rd Qu.: 0.0000
Max. :39.0000
NA's :25
earliest_cr_line inq_last_6mths mths_since_last_delinq mths_since_last_record
Length:798641 Min. : 0.0000 Min. : 0.0 Min. : 0.0
Class :character 1st Qu.: 0.0000 1st Qu.: 15.0 1st Qu.: 51.0
Mode :character Median : 0.0000 Median : 31.0 Median : 70.0
Mean : 0.6947 Mean : 34.1 Mean : 70.1
3rd Qu.: 1.0000 3rd Qu.: 50.0 3rd Qu.: 92.0
Max. :33.0000 Max. :188.0 Max. :129.0
NA's :25 NA's :408818 NA's :675190
open_acc pub_rec revol_bal revol_util total_acc
Min. : 0.00 Min. : 0.0000 Min. : 0 Length:798641 Min. : 1.00
1st Qu.: 8.00 1st Qu.: 0.0000 1st Qu.: 6443 Class :character 1st Qu.: 17.00
Median :11.00 Median : 0.0000 Median : 11876 Mode :character Median : 24.00
Mean :11.55 Mean : 0.1953 Mean : 16930 Mean : 25.27
3rd Qu.:14.00 3rd Qu.: 0.0000 3rd Qu.: 20839 3rd Qu.: 32.00
Max. :90.00 Max. :63.0000 Max. :2904836 Max. :169.00
NA's :25 NA's :25 NA's :2 NA's :25
initial_list_status out_prncp out_prncp_inv total_pymnt
Length:798641 Length:798641 Length:798641 Length:798641
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
total_pymnt_inv total_rec_prncp total_rec_int total_rec_late_fee recoveries
Length:798641 Length:798641 Length:798641 Length:798641 Length:798641
Class :character Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character Mode :character
collection_recovery_fee last_pymnt_d last_pymnt_amnt next_pymnt_d
Length:798641 Length:798641 Length:798641 Length:798641
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
last_credit_pull_d collections_12_mths_ex_med mths_since_last_major_derog policy_code
Length:798641 Min. : 0.00000 Min. : 0.0 Min. :1
Class :character 1st Qu.: 0.00000 1st Qu.: 27.0 1st Qu.:1
Mode :character Median : 0.00000 Median : 44.0 Median :1
Mean : 0.01447 Mean : 44.1 Mean :1
3rd Qu.: 0.00000 3rd Qu.: 61.0 3rd Qu.:1
Max. :20.00000 Max. :188.0 Max. :1
NA's :126 NA's :599107
application_type annual_inc_joint dti_joint verification_status_joint
Length:798641 Length:798641 Length:798641 Length:798641
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
acc_now_delinq tot_coll_amt tot_cur_bal open_acc_6m open_il_6m
Min. : 0.000000 Min. : 0 Min. : 0 Min. : 0.0 Min. : 0.0
1st Qu.: 0.000000 1st Qu.: 0 1st Qu.: 29861 1st Qu.: 0.0 1st Qu.: 1.0
Median : 0.000000 Median : 0 Median : 80647 Median : 1.0 Median : 2.0
Mean : 0.005026 Mean : 228 Mean : 139508 Mean : 1.1 Mean : 2.9
3rd Qu.: 0.000000 3rd Qu.: 0 3rd Qu.: 208229 3rd Qu.: 2.0 3rd Qu.: 4.0
Max. :14.000000 Max. :9152545 Max. :8000078 Max. :14.0 Max. :33.0
NA's :25 NA's :63276 NA's :63276 NA's :779525 NA's :779525
open_il_12m open_il_24m mths_since_rcnt_il total_bal_il il_util
Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0 Length:798641
1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 6.0 1st Qu.: 10164 Class :character
Median : 0.0 Median : 1.0 Median : 12.0 Median : 24545 Mode :character
Mean : 0.8 Mean : 1.7 Mean : 21.1 Mean : 36429
3rd Qu.: 1.0 3rd Qu.: 2.0 3rd Qu.: 23.0 3rd Qu.: 47640
Max. :12.0 Max. :19.0 Max. :363.0 Max. :878459
NA's :779525 NA's :779525 NA's :780030 NA's :779525
open_rv_12m open_rv_24m max_bal_bc all_util total_rev_hi_lim
Min. : 0.0 Min. : 0 Min. : 0 Length:798641 Min. : 0
1st Qu.: 0.0 1st Qu.: 1 1st Qu.: 2406 Class :character 1st Qu.: 13900
Median : 1.0 Median : 2 Median : 4502 Mode :character Median : 23700
Mean : 1.4 Mean : 3 Mean : 5878 Mean : 32093
3rd Qu.: 2.0 3rd Qu.: 4 3rd Qu.: 7774 3rd Qu.: 39800
Max. :22.0 Max. :43 Max. :83047 Max. :9999999
NA's :779525 NA's :779525 NA's :779525 NA's :63276
inq_fi total_cu_tl inq_last_12m
Min. : 0.0 Min. : 0.0 Min. :-4
1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0
Median : 0.0 Median : 0.0 Median : 2
Mean : 0.9 Mean : 1.5 Mean : 2
3rd Qu.: 1.0 3rd Qu.: 2.0 3rd Qu.: 3
Max. :16.0 Max. :35.0 Max. :32
NA's :779525 NA's :779525 NA's :779525
my_data <- raw_data
Start cleaning up the columns
A unique LC assigned ID for the loan listing.
Is: integer Should be: integer and non-zero
filter(my_data, is.na(my_data$id)) # all are non-Null
filter(my_data, id == 0) # all are non-zero
filter(my_data, id < 0) # all are non-zero
Is not relevant to decision-making and should be dropped.
A unique LC assigned Id for the borrower member.
is: integer should be: integer, non-Null, non-zero
filter(my_data, is.na(my_data$member_id)) # all are non-Null
filter(my_data, member_id == 0) # all are non-zero
filter(my_data, member_id < 0) # all are positive
This is probably not relevant, but it could be the case that a member has obtained prior loans, and then that could be very relevant for the interest rate on this loan.
This never actually happens. So better to drop it.
The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.
Is: integer Should be: integer, positive, non-zero, non-null
Data Exploration: 1. Histogram of the loan_amt in various binwidths
filter(my_data, is.na(my_data$loan_amnt)) # all are non-Null
filter(my_data, loan_amnt == 0) # all are non-zero
filter(my_data, loan_amnt < 0) # all are positive
summary(loan_amnt)
Min. 1st Qu. Median Mean 3rd Qu. Max.
500 8000 13000 14754 20000 35000
ggplot(data = my_data) +
geom_histogram(aes(x = loan_amnt), binwidth = 100)
ggplot(data = my_data) +
geom_histogram(aes(x = loan_amnt), binwidth = 250)
ggplot(data = my_data) +
geom_histogram(aes(x = loan_amnt), binwidth = 500)
ggplot(data = my_data) +
geom_histogram(aes(x = loan_amnt), binwidth = 1000)
The total amount committed to that loan at that point in time.
is: integer should be: integer, non-null, non-negative
clear work done on this column
funded_amnt <-raw_data$funded_amnt
View the data
head(my_data$funded_amnt)
[1] 2600 6700 7000 3000 2525 5075
typeof(my_data$funded_amnt)
[1] "integer"
summary(my_data$funded_amnt)
Min. 1st Qu. Median Mean 3rd Qu. Max.
500 8000 13000 14741 20000 35000
filter(my_data, is.na(funded_amnt)) # no Null
filter(my_data, funded_amnt == 0) # no Zero
filter(my_data, funded_amnt < 0) # no negative
See where funded_amnt greater than, less than, equal to loan_amt
Visualize them together
ggplot(data = my_data) +
geom_point(aes(x = loan_amnt, y = funded_amnt), alpha = 0.05)
The total amount committed by investors for that loan at that point in time. is: character should be: integer (or double?)
clear work done on this column
funded_amnt_inv <-raw_data$funded_amnt_inv
Turn them into integers
head(my_data$funded_amnt_inv)
[1] "575" "6700" "3450" "125" "225" "350"
my_data$funded_amnt_inv <- as.integer(my_data$funded_amnt_inv)
head(my_data$funded_amnt_inv)
[1] 575 6700 3450 125 225 350
See where funded_amnt_inv greater than, less than, equal to loan_amt
ggplot(data = my_data) +
geom_point(aes(x = loan_amnt, y = funded_amnt_inv), alpha = 0.05)
This shows two different curves, indicating two different types of bowers receiving investor funding. What is the difference between the two types of borrowers?
The number of payments on the loan. Values are in months and can be either 36 or 60.
NOTE Gwen says this will not be available in new loan applications, but this makes no sense as there is a distinct difference in interest rates for 36 month versus 60 month loans.
is: character, either ” 36 months” or ” 60 months” should be: integer, either 36 or 60 (to represent number of months). But does it even matter if it’s essentially categorical?
This is similar to a categorical variable. Perhaps it explains the two curves observed in funded_amt_inv. (note, no, it does not appear to after analysis)
clear work done on this column
head(term)
[1] " 36 months" " 36 months" " 36 months" " 36 months" " 36 months" " 36 months"
Use this if keeping it as a string
my_data$term <- str_trim(my_data$term)
head(my_data$term)
[1] "36 months" "36 months" "36 months" "36 months" "36 months" "36 months"
visualizing the data
head(term)
[1] " 36 months" " 36 months" " 36 months" " 36 months" " 36 months" " 36 months"
There are about twice as many 36 month loans as 60 month.
Let’s plot the funded_amnt_inv filtering by 36 month and 60 month
This looks a lot like the other graph with no filtering.
This shows three curves. First, there is an upper curve where it begins linearly. Second, there is a discontinuity where the slope drastically changes towards 0 (and possibly appears to be upward curving?) Third, there is the lower curve, which appears more linear than those for 36 month terms, but still downward curving overall.
Visualize term versus loan amount
ggplot() +
geom_boxplot(mapping = aes(x = term, y = loan_amnt), data = filter(my_data, term == "60 months")) +
geom_boxplot(mapping = aes(x = term, y = loan_amnt), data = filter(my_data, term == "36 months"))
60 month loans are for more money than 36 month loans
ggplot() +
geom_boxplot(mapping = aes(x = term, y = funded_amnt), data = filter(my_data, term == "36 months")) +
geom_boxplot(mapping = aes(x = term, y = funded_amnt), data = filter(my_data, term == "60 months"))
The below provides a worthless graph. Nearly all loans fund regardless of term.
ggplot() +
geom_boxplot(mapping = aes(x = term, y = funded_amnt / loan_amnt), data = filter(my_data, term == "36 months")) +
geom_boxplot(mapping = aes(x = term, y = funded_amnt / loan_amnt), data = filter(my_data, term == "60 months"))
Interest Rate on the loan
is: character should be: float
head(my_data$int_rate)
[1] "8.38" "7.75" "7.75" "9.01" "9.33" "10.28"
my_data$int_rate <- as.double(my_data$int_rate)
head(my_data$int_rate)
[1] 8.38 7.75 7.75 9.01 9.33 10.28
ggplot(my_data) +
geom_histogram(aes(x = int_rate), binwidth = .1)
ggplot(my_data) +
geom_histogram(aes(x = int_rate), binwidth = .25)
ggplot(my_data) +
geom_histogram(aes(x = int_rate), binwidth = .5)
ggplot(my_data) +
geom_histogram(aes(x = int_rate), binwidth = 1)
Interest rate versus loan amount 60 month loans have a higher starting interest rate, and there are very few under 10,000 loan amount
ggplot(data = my_data, mapping = aes(x = loan_amnt, y = int_rate, color = term)) +
geom_point(alpha = .05)
The monthly payment owed by the borrower if the loan originates.
Note Gwen says this will not be available in the new data.
This should be directly correlated with the loan amount and interest rate and determined by term. It may be useful to use these values instead of those when say, comparing to income, in order to avoid calculation.
Is: string Should be: double
head(my_data$installment)
[1] "81.94" "209.18" "218.55" "95.42" "80.69" "164.42"
my_data$installment <- as.double(my_data$installment)
head(my_data$installment)
[1] 81.94 209.18 218.55 95.42 80.69 164.42
The job title supplied by the Borrower when applying for the loan. Employer Title replaces Employer Name for all loans listed after 9/23/2013
This is a string entered by the user and should be dropped.
my_data <- subset(my_data, select = -emp_title)
Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.
Is: string Should be: Integer
head(my_data$emp_length)
[1] 3 0 0 0 0 0
Search NA
More than 40,000 rows have no employment information. That seems like a lot to drop. How should we handle them?
The home ownership status provided by the borrower during registration. Our values are: RENT, OWN, MORTGAGE, OTHER.
Is: String Should be: Perhaps also an ordered set of integers (“OTHER”, “RENT”, “MORTGAGE”, “OWN”). Could also make dummy columns, but this is my order if I were assessing borrowers. We should compare both models.
head(my_data$home_ownership)
[1] "MORTGAGE" "NONE" "NONE" "RENT" "RENT" "RENT"
my_data$home_ownership <- as.integer(ordered(my_data$home_ownership, levels = c("OTHER", "RENT", "MORTGAGE", "OWN")))
head(my_data$home_ownership)
[1] 3 NA NA 2 2 2
Drop nulls. There are very few (47)
filter(my_data, is.na(my_data$home_ownership))
my_data <- filter(my_data, ! is.na(my_data$home_ownership))
filter(my_data, is.na(my_data$home_ownership))
The self-reported annual income provided by the borrower during registration.
Is: string should be: Int
Filter for null, zero, negative
filter(my_data, is.na(my_data$annual_inc)) # four items are non-Null, drop them.
filter(my_data, my_data$annual_inc == 0) # two are zero income, but significant income from joint applicant, so don't drop them.
filter(my_data, my_data$annual_inc < 10000) # 460 applicants.
filter(my_data, my_data$annual_inc < 500) # only the two 0 income applicants
filter(my_data, my_data$annual_inc < 0) # all are positive
Drop NA
my_data <- filter(my_data, ! is.na(annual_inc))
head(my_data$annual_inc)
[1] 6500 35000 110000 95000 150000 59000
ggplot(data = filter(my_data, annual_inc > 1000000), aes(x = annual_inc, y = int_rate)) +
geom_point(aes(color = verification_status))
We should divide income over $1mil by 100 because presumably some of these people entered dollars and cents, but the system did not recognize that they were entering cents.
Needs a decision on the limit of income, where we will assume incorrect data entry. I assume $1,000,000 should have been $10,000.00, but it could be higher or lower.
my_data$annual_inc <- ifelse(my_data$annual_inc >= 1000000, my_data$annual_inc / 100, my_data$annual_inc)
filter(my_data, annual_inc > 1000000)
summary(my_data$annual_inc)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 45000 64999 74559 90000 999999
is_inc_v in the dictionary Indicates if income was verified by LC, not verified, or if the income source was verified
make it an integer, with not verified at 0 in order to show the gain from verified
levels(my_data$verification_status)
NULL
no NA
Does verification have a benefit? Yes
lm.fit <- lm(int_rate~verification_status, my_data)
summary(lm.fit)
Call:
lm(formula = int_rate ~ verification_status, data = my_data)
Residuals:
Min 1Q Median 3Q Max
-9.2762 -3.2179 -0.2179 2.6738 17.1704
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.819567 0.007769 1521.4 <2e-16 ***
verification_status 1.388327 0.005991 231.8 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.242 on 798592 degrees of freedom
Multiple R-squared: 0.06302, Adjusted R-squared: 0.06302
F-statistic: 5.371e+04 on 1 and 798592 DF, p-value: < 2.2e-16
ggplot(data = filter(my_data, verification_status == 0), aes(x = verification_status, y = int_rate)) +
geom_boxplot()
ggplot(data = filter(my_data, verification_status == 1), aes(x = verification_status, y = int_rate)) +
geom_boxplot()
ggplot(data = filter(my_data, verification_status == 2), aes(x = verification_status, y = int_rate)) +
geom_boxplot()
This is surprising that people with verified income have a higher interest rate than people with an unverified income.
The month which the loan was funded
Note Gwen says this will not be in new data.
Is: string Should be:
Is this useful? Probably would imagine that interest rates change.
Added the date as 01 in order to convert to a date.
head(my_data$issue_d)
[1] "Jun-2007" "Jun-2007" "Jun-2007" "Jul-2007" "Jul-2007" "Jul-2013"
f <- my_data$issue_d
for (n in 1:length(f)){
f[n] <- paste("01-", f[n], sep = "")
}
my_data$issue_d <- as.Date(f, format = "%d-%b-%Y")
head(my_data$issue_d)
[1] "2007-06-01" "2007-06-01" "2007-06-01" "2007-07-01" "2007-07-01" "2013-07-01"
See interest rate over time. I need to make a mean interest rate for each month, or maybe filter by term, to make this more helpful.
ggplot(my_data, aes(x = issue_d, y = int_rate)) +
geom_line()
Current status of the loan
This is for after the loan is issued, so I don’t think it will be useful in predicting interest rate.
head(my_data$loan_status)
[1] "Does not meet the credit policy. Status:Charged Off"
[2] "Does not meet the credit policy. Status:Fully Paid"
[3] "Does not meet the credit policy. Status:Fully Paid"
[4] "Does not meet the credit policy. Status:Fully Paid"
[5] "Does not meet the credit policy. Status:Fully Paid"
[6] "Charged Off"
This applies only to current loans and should be deleted
my_data <- subset(my_data, select = -loan_status)
Indicates if a payment plan has been put in place for the loan
This applies only to current, defaulted loans and should be deleted
my_data <- subset(my_data, select = -pymnt_plan)
URL for the LC page with listing data.
This information is not relevant to analysis and should be dropped.
my_data <- subset(my_data, select = -url)
Loan description provided by the borrower
This information could be important but would be impossible to analyse since it is user-entered.
my_data <- subset(my_data, select = -desc)
A category provided by the borrower for the loan request.
Is: String Should be: Dummy columns.
The answer here is relevant for the interest rate.
levels(factor(my_data$purpose))
[1] "car" "credit_card" "debt_consolidation" "educational"
[5] "home_improvement" "house" "major_purchase" "medical"
[9] "moving" "other" "renewable_energy" "small_business"
[13] "vacation" "wedding"
ggplot(my_data, aes(x = purpose, y = int_rate)) +
geom_boxplot()
Should I remove the first dummy to guard against multicolliniarity?
Need to see whether each category has a good number of observations.
The loan title provided by the borrower
This information could be important but would be impossible to analyse since it is user-entered.
my_data <- subset(my_data, select = -title)
The first 3 numbers of the zip code provided by the borrower in the loan application.
This information would indicate where in the country the borrower is located. It might be a relevant variable, but it would at best be an approximation of a borrower’s income. I do not think it provides significant information.
my_data <- subset(my_data, select = -zip_code)
The state provided by the borrower in the loan application
This information would indicate where in the country the borrower is located. It might be a relevant variable, but it would at best be an approximation of a borrower’s income. I do not think it provides significant information.
my_data <- subset(my_data, select = -addr_state)
A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.
Is: string Should be: double
filter(my_data, is.na(dti)) # no Null
filter(my_data, dti == 0) # 403 are zero, but this indicates no debt. Seems odd they turn to Lending Club for a loan if this is true.
filter(my_data, dti < 1)
filter(my_data, dti < 0) # no negative
head(my_data$dti)
[1] "6.46" "10" "10" "10" "5.6" "24.55"
f <- as.double(my_data$dti)
head(f)
[1] 6.46 10.00 10.00 10.00 5.60 24.55
The number of 30+ days past-due incidences of delinquency in the borrower’s credit file for the past 2 years
Is: Integer Should be: Integer
typeof(my_data$delinq_2yrs)
[1] "integer"
summary(my_data$delinq_2yrs)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0000 0.0000 0.0000 0.3145 0.0000 39.0000 21
There are 21 NAs. Drop them.
my_data <- filter(my_data, ! is.na(my_data$delinq_2yrs))
summary(my_data$delinq_2yrs)
The month the borrower’s earliest reported credit line was opened
filter(raw_data, is.na(earliest_cr_line)) # no NA
This could be a placeholder for the borrower’s age. Is it correlated with interest rate? I would imagine that if this date is old, the person is old, so has a higher interest rate, and if young, low income, so higher interest rate. But this would simply be because of my expectations of income by age, and this data does not directly tell us anything.
Added the date as 01 in order to convert to a date.
head(my_data$earliest_cr_line)
[1] "Mar-1984" "Dec-1994" "Jul-1993" "Jan-1983" "Oct-1996" "Dec-1997"
f <- my_data$earliest_cr_line
for (n in 1:length(f)){
f[n] <- paste("01-", f[n], sep = "")
}
my_data$earliest_cr_line <- as.Date(f, format = "%d-%b-%Y")
head(my_data$earliest_cr_line)
[1] "1984-03-01" "1994-12-01" "1993-07-01" "1983-01-01" "1996-10-01" "1997-12-01"
Drop the data as it is not really useful.
my_data <- subset(my_data, select = -earliest_cr_line)
The number of inquiries in past 6 months (excluding auto and mortgage inquiries)
This represents how many times a lender has pulled the borrower’s credit report. In itself it does not mean anything, but a higher value is taken as an indication that the borrower is not creditworthy (else, why ask for so much debt?) 0-2 is fine.
There are no NA values
Plot the range
ggplot(data = my_data, aes(x = inq_last_6mths)) +
geom_histogram(binwidth = 1)
Power law distribution.
summary(lm.1)
Call:
lm(formula = int_rate ~ inq_last_6mths, data = my_data)
Residuals:
Min 1Q Median 3Q Max
-31.3484 -3.3794 -0.2518 2.7606 16.4406
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.549391 0.005818 2157.0 <2e-16 ***
inq_last_6mths 1.001219 0.004785 209.2 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.267 on 798571 degrees of freedom
Multiple R-squared: 0.05197, Adjusted R-squared: 0.05197
F-statistic: 4.378e+04 on 1 and 798571 DF, p-value: < 2.2e-16
This has a significant impact on interest rate.
The number of months since the borrower’s last delinquency.
This is the time that has passed since a buyer was delinquent. Approximately half of the entries are NA. HOWEVER if the borrower has never been delinquent (that is, always pays on time), it SHOULD be NA. NA is a better value than having anything here. BUT the data shows this is not actually true, and those with NA have a higher interest rate.
Possible solutions: very, very high value to replace NA?
ggplot(data = my_data, aes(x = mths_since_last_delinq, y = int_rate)) +
geom_point(alpha = 0.05)
There is a hard cutoff at 84 months because those delinquencies are no longer reported on a credit report (7 years). There are still some observations beyond that time period, but I am not sure why. They may pre-date the change in the law (I recall this was sometime around 2009-2010) or loopholes that allow continuing reporting, or self-reporting to LendingClub.
We need to figure out a way to deal with the <84, >=84 problem.
count(my_data, mths_since_last_delinq == 83)
count(my_data, mths_since_last_delinq == 84)
count(my_data, mths_since_last_delinq == 85)
NA
Those with an NA have a higher interest rate, which I would not expect.
ggplot(data = my_data) +
geom_boxplot(mapping = aes(x = is.na(mths_since_last_delinq), y = int_rate))
The number of months since the last public record.
This would be the months since the last public record filing, which means a lawsuit to collect a debt. From my old credit report from 2015: “Public record information includes bankruptcies, liens or judgments and comes from federal, state or county court records.” Today, they include only bankruptcies, but this is not relevant.
This is very, very bad, and much better if it is NA (i.e. there are none)
levels(factor(my_data$mths_since_last_record))
[1] "0" "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14"
[16] "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29"
[31] "30" "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "43" "44"
[46] "45" "46" "47" "48" "49" "50" "51" "52" "53" "54" "55" "56" "57" "58" "59"
[61] "60" "61" "62" "63" "64" "65" "66" "67" "68" "69" "70" "71" "72" "73" "74"
[76] "75" "76" "77" "78" "79" "80" "81" "82" "83" "84" "85" "86" "87" "88" "89"
[91] "90" "91" "92" "93" "94" "95" "96" "97" "98" "99" "100" "101" "102" "103" "104"
[106] "105" "106" "107" "108" "109" "110" "111" "112" "113" "114" "115" "116" "117" "118" "119"
[121] "120" "121" "129"
count(my_data, is.na(mths_since_last_record))
ggplot(data = my_data, aes(x = mths_since_last_record, y = int_rate)) +
geom_point(alpha = 0.05)
I do not see a pattern appear.
They also remain on your credit report for ten years during this time period (this is no longer true, but not relevant)
count(my_data, mths_since_last_record == 119)
count(my_data, mths_since_last_record == 120)
NA
Thos with no public records have higher interest rates than those who do, which I would not expect.
ggplot(data = my_data) +
geom_boxplot(mapping = aes(x = is.na(mths_since_last_record), y = int_rate))
The number of open credit lines in the borrower’s credit file.
Is: Integer Should be: Integer
I would expect this to be generally on the high side for users of this platform.
typeof(my_data$open_acc)
[1] "integer"
levels(factor(my_data$open_acc))
[1] "0" "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15" "16" "17"
[19] "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30" "31" "32" "33" "34" "35"
[37] "36" "37" "38" "39" "40" "41" "42" "43" "44" "45" "46" "47" "48" "49" "50" "51" "52" "53"
[55] "54" "55" "56" "57" "58" "59" "60" "61" "62" "63" "64" "65" "66" "67" "68" "75" "79" "82"
[73] "84" "90"
count(my_data, is.na(open_acc)) # no NA
ggplot(data = my_data, aes(x = open_acc)) +
geom_histogram()
summary(my_data$open_acc)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 8.00 11.00 11.55 14.00 90.00
Is this associated with int_rate?
lm.2 <- lm(int_rate~open_acc, my_data)
summary(lm.2)
Call:
lm(formula = int_rate ~ open_acc, data = my_data)
Residuals:
Min 1Q Median 3Q Max
-8.018 -3.277 -0.251 2.941 15.995
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.3463730 0.0117312 1137.684 <2e-16 ***
open_acc -0.0087796 0.0009227 -9.515 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.382 on 798571 degrees of freedom
Multiple R-squared: 0.0001134, Adjusted R-squared: 0.0001121
F-statistic: 90.53 on 1 and 798571 DF, p-value: < 2.2e-16
This says the interest rate goes down for people with more open accounts and that this is significant, which seems odd. The relationship is probably non-linear, if significant.
Number of derogatory public records
Is: Integer Should be: Integer
Histogram shows very, very strongly power law.
ggplot(data = my_data, aes(x = pub_rec)) +
geom_histogram()
summary(my_data$pub_rec)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.0000 0.0000 0.1953 0.0000 63.0000
Total credit revolving balance
This is the revolving debt (credit card, usually) of the borrower. It will be a part of the dti calculation.
Two are na. One has a revol_util figure, so I expect this is a data error. The other has a revol_util of 0, so this may be showing a revol_bal == 0. Since it is only two observations, we should drop them.
Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.
This is a ratio showing how maxed out the borrower’s credit cards are. We would expect creditworthy borrowers to have a low ratio.
Is: string Should be: double
ggplot(data = my_data, aes(x = revol_util, y = int_rate)) +
geom_point(alpha = 0.05)
Has a fairly normal distribution.
The total number of credit lines currently in the borrower’s credit file
I do not expect this to be useful to the analysis. It does not distinguish between open or closed accounts. In general not a good predictor.
Is: integer Should be: integer
head(my_data$total_acc)
[1] 16 19 43 28 20 35
The initial listing status of the loan. Possible values are – W, F
“Lending Club reserves a few loans for 12 hours and offers them to the institutional and large retail lenders who want to lend the whole amount for a loan. I am not sure whether the historical loan data file includes the loans that were offered and picked up by lenders as ‘whole’ loans. But, the loans that were initially offered as whole, designated with ‘w’, but not picked up as ‘whole’ loans are listed in historical loan data file.” https://andirog.blogspot.com/2013/04/lending-club-borrowers-income.html
This may be useful because those loans taken by institutional and large lenders may be the most preferred loans by lenders.
head(my_data$initial_list_status)
[1] "f" "f" "f" "f" "f" "f"
Is there a difference?
w loans tend to have a lower interest rate.
Make it an integer, with f == 0 and w == 1, but this could instead be dummy columns
head(my_data$initial_list_status)
[1] "f" "f" "f" "f" "f" "f"
my_data$initial_list_status <- as.integer(factor(my_data$initial_list_status)) - 1
head(my_data$initial_list_status)
[1] 0 0 0 0 0 0
Is this useful?
lm.3 <- lm(int_rate~initial_list_status, my_data)
summary(lm.3)
Call:
lm(formula = int_rate ~ initial_list_status, data = my_data)
Residuals:
Min 1Q Median 3Q Max
-8.4143 -3.5559 -0.2359 2.8557 16.2641
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.734260 0.006789 2022.9 <2e-16 ***
initial_list_status -1.008355 0.009747 -103.5 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.353 on 798569 degrees of freedom
Multiple R-squared: 0.01323, Adjusted R-squared: 0.01322
F-statistic: 1.07e+04 on 1 and 798569 DF, p-value: < 2.2e-16
Remaining outstanding principal for total amount funded
This is only relevant after a loan is issued and should be dropped.
my_data <- subset(my_data, select = -out_prncp)
Error in eval(substitute(select), nl, parent.frame()) :
object 'out_prncp' not found
Remaining outstanding principal for portion of total amount funded by investors
This is only relevant after a loan is issued and should be dropped.
my_data <- subset(my_data, select = -out_prncp_inv)
Payments received to date for total amount funded
This is only relevant after a loan is issued and should be dropped.
my_data <- subset(my_data, select = -total_pymnt)
Payments received to date for portion of total amount funded by investors
This is only relevant after a loan is issued and should be dropped.
my_data <- subset(my_data, select = -total_pymnt_inv)
head(my_data$dti_joint)
f <- as.double(my_data$dti_joint)
summary(f)